Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors

Background: Variant interpretation is essential for identifying patients’ disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results: The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions: VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability: VIPdb version 2 is available at https://genomeinterpretation.org/vipdb


Background
Advances in sequencing technologies, including gene panels, whole exome sequencing, whole genome sequencing, and long read sequencing, have revolutionized the investigation of genetic variation on a large scale and hence have accelerated the discovery of novel genetic etiologies of diseases and improved the efficiency of diagnosis (1,2).Typically, thousands to millions of variants are identified in each individual (3,4), making it challenging to distinguish disease-causing variants from noncontributory ones.Consequently, methods to predict the impacts of variants being disease-causing are essential (5,6).This need prompted the development of Variant Impact Predictors (VIPs), tools or databases designed to predict the consequences of genetic variants.Hundreds of genetic VIPs have been developed, with a variety of methodologies and goals (7).Some overlapping categories of variants considered by different tools are single nucleotide variations (SNVs), insertions and deletions (indels), structural variations (SVs), nonsynonymous variants, synonymous variants, splicing variants, and regulatory variants.VIPs are designed for different contexts, such as for germline variants, somatic variants, or specific diseases or genes.The variety of VIPs underscores the complex nature of variant interpretation and poses a challenge for users in identifying the most suitable VIPs for their specific needs.Many computational impact prediction methods have been developed, yet the field lacks a clear consensus on their appropriate use and interpretation (8).Recognizing the need for an organized approach to explore available VIPs, several research entities have constructed resources facilitating the informed use of VIPs.Initiatives like the Critical Assessment of Genome Interpretation (CAGI) conduct community experiments to assess VIPs across different variant types and contexts (8,9,10).The dbNSFP (database for Nonsynonymous Single-nucleotide polymorphisms' Functional Predictions) hosts precomputes of several VIP results (11).OpenCRAVAT integrates hundreds of VIP analyses of cancer-related variants in one platform, enhancing accessibility for users (12).These resources have played an important role in introducing users to VIP options.Consequently, we developed VIPdb to serve as a comprehensive resource for exploring VIPs.
To systematically evaluate the pathogenicity of a variant in a clinical laboratory, ACMG/AMP has established guidelines for interpreting genetic variants that integrate several lines of evidence, including population data, functional data, segregation data, and computational prediction (13).Historically, VIPs provided only supporting evidence in determining the pathogenicity or benignity of variants in clinical settings.However, recent ClinGen clinical recommendations allow VIPs the potential to provide stronger evidence (14).This greater role for VIPs in providing evidence for clinical decisions could improve genetic disease diagnosis.
To facilitate users' exploration of available VIPs, we described key features of each VIP.
VIPs primarily designed for variant impact prediction were labeled as such.VIPs not originally designed for variant impact prediction but nonetheless used for this purpose, such as those estimating conservation scores and population allele frequencies, were categorized as non-primary.VIPs containing clinical classifications, functional data, or population data were categorized as databases, whereas VIPs utilizing databases for computing variant impact predictions were classified as non-databases.Furthermore, as VIPs are designed for different types of genetic variants, we classified the VIPs according to the following overlapping categories of input: single nucleotide variant (SNV), insertion and deletion (indel) variant, structural variant (SV), nonsynonymous/nonsense variant, synonymous variant, splicing variant, and regulatory region variants, with some overlap among these categories.Licensing information, including whether the VIP is free for academic or commercial use, was also included.In addition, we provided details about accessing VIPs, such as homepage links and source code availability.
In VIPdb version 2, we have made enhancements to inform clinical decision-making.
We incorporated calibrated threshold scores recommended by ClinGen for clinical use (14) with ACMG/AMP guidelines for variant classification (13).Additionally, we included community assessment results from the CAGI 6 Annotate All Missense / Missense Marathon challenge (416) to enable users to compare the overall performance of methods and the performance on subsets with high specificity or high sensitivity.
To understand the trends of genetic VIPs over 25 years, we conducted a citation analysis.We utilized the Entrez module in Biopython to retrieve citation information from the PubMed database.Specifically, the elink function was employed to collect the number of articles citing each VIP, and the esummary function allowed for the collection of publication years for these citations.These functions facilitated the automatic collection of citation numbers by year for each VIP.An analysis of the variant type used by VIP showed a predominant focus on predicting the impacts of single nucleotide variants (SNVs) and nonsynonymous variants (Fig. 1).
Since the 2010s, there has been a notable surge in the development of VIPs tailored for insertions and deletions (indels), while VIPs dedicated to predicting the impacts of splicing, structural, synonymous, and regulatory variants have grown more modestly (Fig. 1).These observations about VIP variant type not only highlight current focus on but also identify areas that have been less explored, suggesting potential directions for future research.
The citation rate of VIPs continues to rise, while the annual publications of VIPs have reached a plateau (Fig. 2).The increasing citation rates for both the 274 core VIPs and the 129 non-core VIPs reflect the ongoing growth of VIP usage (Fig. 2A).The median total citation for VIPs is 41 from 1998 to 2023, with a 95% quantile of 2612 citations (Fig. 2B).Annual publication showed a stabilization in VIP publications, with some being subsequent publications from previous work (Fig. 2C).
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It

13
The citation trend of 274 core VIPs from 1998 to 2023 is shown in Fig. 3 and 4. The citation analysis revealed that SIFT and PolyPhen, among the earliest, are the most cited core VIPs (Fig. 3 and 4).

Discussion and Conclusions
VIPdb version 2 provides a comprehensive view of VIPs.To identify the most appropriate VIPs for user's specific needs, users are advised to thoroughly assess the strengths and weaknesses of VIPs before determining their suitability for use.For In summary, VIPdb version 2 presents a collection of 403 VIPs developed over the last 25 years, with their characteristics, citation patterns, publication details, and access .CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made The copyright holder for this preprint this version posted June 28, 2024.; https://doi.org/10.1101/2024.06.25.600283doi: bioRxiv preprint 11 information.VIPdb version 2 is publicly accessible at https://genomeinterpretation.org/vipdb Results We incorporated 186 additional VIPs into VIPdb version 2, alongside the existing 217 VIPs in the previous version of VIPdb.We summarized the characteristics of the 403 VIPs in VIPdb version 2. Among the 403 VIPs in VIPdb version 2, 274 are core VIPs, defined as VIPs primarily designed for variant impact prediction and not a database.

Figure 2 .Figure 3 .
Figure 2. Citation and publication analysis of 403 VIPs.(a) Citations each year for 274